Pass down storage options #5673

dwyatte · 2023-03-26T20:09:37Z

Remove implementation-specific kwargs from file_utils.fsspec_get and file_utils.fsspec_head, instead allowing them to be passed down via storage_options. This fixes an issue where s3fs did not recognize a timeout arg as well as fixes an issue mentioned in #5281 by allowing users to pass down storage_options all the way from datasets.load_dataset to support implementation-specific credentials

Supports something like the following to provide credentials explicitly instead of relying on boto's methods of locating them

load_dataset(..., data_files=["s3://..."], storage_options={"profile": "..."})

HuggingFaceDocBuilderDev · 2023-03-26T20:17:05Z

The documentation is not available anymore as the PR was closed or merged.

lhoestq

Awesome ! Just one comment:

download_and_prepare is not called when streaming a dataset, so we may need to have storage_options in the DatasetBuilder.__init__ ? This way it could also be passed later to as_streaming_dataset and the StreamingDownloadManager

Currently the storage_options parameter in download_and_prepare are for the target filesystem where the dataset must be downloaded and prepared as arrow files

src/datasets/utils/file_utils.py

dwyatte · 2023-03-27T13:36:39Z

download_and_prepare is not called when streaming a dataset, so we may need to have storage_options in the DatasetBuilder.init ? This way it could also be passed later to as_streaming_dataset and the StreamingDownloadManager

Currently the storage_options parameter in download_and_prepare are for the target filesystem where the dataset must be downloaded and prepared as arrow files

Ah, I noted this when looking for ways to plumb down storage_options although I think I was looking at adding to BuilderConfig. The DatasetBuilder constructor looks more appropriate for this, will get that added in a future commit

…ethods

lhoestq

Amazing ! It works like a charm :)

Just before we merge I wanted to mention this in the docstring, let me know what you think:

src/datasets/load.py

dwyatte · 2023-03-28T13:41:04Z

Noting as experimental SGTM. The only tests I can think of to add at the moment would be mocks that assert the storage options get passed all the way down using mock.assert_called_with but if Hugging Face has some S3/GCS buckets for testing, maybe those would be better in a future PR. Let me know what you think

Co-authored-by: Quentin Lhoest <[email protected]>

lhoestq · 2023-03-28T14:53:46Z

I think adding tests with the mockfs fixture will do the job. Tests and docs can be added when request_etag and is_remote_url support fsspec (right now they would fail with mockfs).

Let's see in a subsequent PR, this is exciting ! :)

github-actions · 2023-03-28T15:03:38Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009217 / 0.011353 (-0.002136)	0.006275 / 0.011008 (-0.004733)	0.124361 / 0.038508 (0.085853)	0.035680 / 0.023109 (0.012570)	0.395255 / 0.275898 (0.119357)	0.426104 / 0.323480 (0.102624)	0.006822 / 0.007986 (-0.001163)	0.004467 / 0.004328 (0.000138)	0.099404 / 0.004250 (0.095153)	0.051919 / 0.037052 (0.014867)	0.388286 / 0.258489 (0.129797)	0.426361 / 0.293841 (0.132520)	0.053100 / 0.128546 (-0.075446)	0.019453 / 0.075646 (-0.056194)	0.433139 / 0.419271 (0.013867)	0.063240 / 0.043533 (0.019707)	0.381175 / 0.255139 (0.126036)	0.411686 / 0.283200 (0.128487)	0.104843 / 0.141683 (-0.036840)	1.853582 / 1.452155 (0.401427)	1.935644 / 1.492716 (0.442928)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.218969 / 0.018006 (0.200963)	0.515011 / 0.000490 (0.514522)	0.004017 / 0.000200 (0.003818)	0.000097 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028975 / 0.037411 (-0.008437)	0.125239 / 0.014526 (0.110713)	0.131371 / 0.176557 (-0.045185)	0.203864 / 0.737135 (-0.533271)	0.140784 / 0.296338 (-0.155554)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.620701 / 0.215209 (0.405492)	6.263557 / 2.077655 (4.185903)	2.510058 / 1.504120 (1.005938)	2.085892 / 1.541195 (0.544697)	2.170362 / 1.468490 (0.701872)	1.325600 / 4.584777 (-3.259177)	5.583355 / 3.745712 (1.837642)	5.092791 / 5.269862 (-0.177071)	2.814766 / 4.565676 (-1.750911)	0.153568 / 0.424275 (-0.270707)	0.014850 / 0.007607 (0.007243)	0.787011 / 0.226044 (0.560967)	7.948813 / 2.268929 (5.679885)	3.320831 / 55.444624 (-52.123793)	2.526327 / 6.876477 (-4.350150)	2.691651 / 2.142072 (0.549579)	1.521199 / 4.805227 (-3.284028)	0.269738 / 6.500664 (-6.230926)	0.082959 / 0.075469 (0.007490)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.740056 / 1.841788 (-0.101732)	17.699732 / 8.074308 (9.625424)	22.450689 / 10.191392 (12.259297)	0.229350 / 0.680424 (-0.451073)	0.027486 / 0.534201 (-0.506715)	0.536153 / 0.579283 (-0.043130)	0.608166 / 0.434364 (0.173802)	0.629144 / 0.540337 (0.088807)	0.732671 / 1.386936 (-0.654265)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010147 / 0.011353 (-0.001206)	0.006484 / 0.011008 (-0.004524)	0.098664 / 0.038508 (0.060156)	0.036400 / 0.023109 (0.013291)	0.432895 / 0.275898 (0.156997)	0.466433 / 0.323480 (0.142954)	0.008102 / 0.007986 (0.000117)	0.004554 / 0.004328 (0.000225)	0.100466 / 0.004250 (0.096216)	0.054066 / 0.037052 (0.017013)	0.439177 / 0.258489 (0.180688)	0.502907 / 0.293841 (0.209066)	0.059210 / 0.128546 (-0.069336)	0.020220 / 0.075646 (-0.055426)	0.124671 / 0.419271 (-0.294600)	0.064278 / 0.043533 (0.020746)	0.435659 / 0.255139 (0.180520)	0.459670 / 0.283200 (0.176471)	0.115574 / 0.141683 (-0.026109)	1.826360 / 1.452155 (0.374205)	1.943199 / 1.492716 (0.450483)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.238463 / 0.018006 (0.220457)	0.534889 / 0.000490 (0.534400)	0.000404 / 0.000200 (0.000204)	0.000092 / 0.000054 (0.000038)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033210 / 0.037411 (-0.004201)	0.133529 / 0.014526 (0.119003)	0.143813 / 0.176557 (-0.032743)	0.213079 / 0.737135 (-0.524056)	0.148427 / 0.296338 (-0.147912)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.656819 / 0.215209 (0.441610)	6.414860 / 2.077655 (4.337205)	2.756182 / 1.504120 (1.252062)	2.405268 / 1.541195 (0.864073)	2.436418 / 1.468490 (0.967928)	1.289828 / 4.584777 (-3.294949)	5.572731 / 3.745712 (1.827018)	3.185432 / 5.269862 (-2.084429)	2.093220 / 4.565676 (-2.472457)	0.144817 / 0.424275 (-0.279458)	0.015674 / 0.007607 (0.008067)	0.801238 / 0.226044 (0.575194)	7.955925 / 2.268929 (5.686996)	3.605670 / 55.444624 (-51.838955)	2.837568 / 6.876477 (-4.038908)	2.873848 / 2.142072 (0.731775)	1.493512 / 4.805227 (-3.311715)	0.266251 / 6.500664 (-6.234413)	0.082417 / 0.075469 (0.006948)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.608685 / 1.841788 (-0.233103)	18.587875 / 8.074308 (10.513567)	21.786119 / 10.191392 (11.594727)	0.261748 / 0.680424 (-0.418675)	0.026228 / 0.534201 (-0.507973)	0.553538 / 0.579283 (-0.025745)	0.599780 / 0.434364 (0.165416)	0.665663 / 0.540337 (0.125325)	0.792785 / 1.386936 (-0.594151)

dwyatte mentioned this pull request Mar 26, 2023

Support cloud storage in load_dataset #5281

Open

pass down storage options

94bc9ef

dwyatte force-pushed the storage_options branch from 76d9ad4 to 94bc9ef Compare March 26, 2023 20:11

lhoestq reviewed Mar 27, 2023

View reviewed changes

src/datasets/utils/file_utils.py Outdated Show resolved Hide resolved

add storage_options to dataset builder class and plumb through load m…

a11bf3f

…ethods

lhoestq approved these changes Mar 28, 2023

View reviewed changes

src/datasets/load.py Outdated Show resolved Hide resolved

src/datasets/load.py Outdated Show resolved Hide resolved

dwyatte and others added 2 commits March 28, 2023 07:41

Update src/datasets/load.py

6ff6429

Co-authored-by: Quentin Lhoest <[email protected]>

Update src/datasets/load.py

96fc3cd

Co-authored-by: Quentin Lhoest <[email protected]>

lhoestq merged commit 1520e01 into huggingface:main Mar 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pass down storage options #5673

Pass down storage options #5673

dwyatte commented Mar 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 26, 2023 •

edited

Loading

lhoestq left a comment

dwyatte commented Mar 27, 2023 •

edited

Loading

lhoestq left a comment

dwyatte commented Mar 28, 2023

lhoestq commented Mar 28, 2023

github-actions bot commented Mar 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Pass down storage options #5673

Pass down storage options #5673

Conversation

dwyatte commented Mar 26, 2023 • edited Loading

HuggingFaceDocBuilderDev commented Mar 26, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

dwyatte commented Mar 27, 2023 • edited Loading

lhoestq left a comment

Choose a reason for hiding this comment

dwyatte commented Mar 28, 2023

lhoestq commented Mar 28, 2023

github-actions bot commented Mar 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

dwyatte commented Mar 26, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Mar 26, 2023 •

edited

Loading

dwyatte commented Mar 27, 2023 •

edited

Loading